10 research outputs found

    Mapping for maximum performance on FPGA DSP blocks

    Get PDF
    The digital signal processing (DSP) blocks on modern field programmable gate arrays (FPGAs) are highly capable and support a variety of different datapath configurations. Unfortunately, inference in synthesis tools can fail to result in circuits that reach maximum DSP block throughput. We have developed a tool that maps graphs of add/sub/mult nodes to DSP blocks on Xilinx FPGAs, ensuring maximum throughput. This is done by delaying scheduling until after the graph has been partitioned onto DSP blocks and scheduled based on their pipeline structure, resulting in a throughput optimized implementation. Our tool prepares equivalent implementations in a variety of other methods, including high-level synthesis (HLS) for comparison. We show that the proposed approach offers an improvement in frequency of 100% over standard pipelined code, and 23% over Vivado HLS synthesis implementation, while retaining code portability, at the cost of a modest increase in logic resource usage

    Multipumping flexible DSP blocks for resource reduction on Xilinx FPGAs

    Get PDF
    For complex datapaths, resource sharing can help reduce area consumption. Traditionally, resource sharing is applied when the same resource can be scheduled for different uses in different cycles, often resulting in a longer schedule. Multipumping is a method whereby a resource is clocked at a frequency that is a multiple of the surrounding circuit, thereby offering multiple executions per global clock cycle. This allows a single resource to be shared among multiple uses in the same cycle. This concept maps well to modern field-programmable gate arrays (FPGAs), where hard macro blocks are typically capable of running at higher frequencies than most designs implemented in the logic fabric. While this technique has been demonstrated for static resources, modern digital signal processing (DSP) blocks are flexible, supporting varied operations at runtime. In this paper, we demonstrate multipumping for resource sharing of the flexible DSP48E1 macros in Xilinx FPGAs. We exploit their dynamic programmability to enable resource sharing for the full set of supported DSP block operations, and compare this to multipumping only multipliers and DSP blocks with fixed configurations. The proposed approach saves on average 48% DSP blocks at a cost of 74% more LUTs, effectively saving 30% equivalent LUT area and is feasible for the majority of designs, in which clock frequency is typically below half the maximum supported by the DSP blocks

    Minimizing DSP block usage through multi-pumping

    Get PDF
    Resource sharing in the mapping of an algorithm to an architecture allows the same resource to be scheduled for different uses in different cycles, generally at the cost of increased schedule length. Multi-pumping is a method whereby a resource is clocked at a frequency that is a multiple of the surrounding circuit, thereby offering multiple executions per global clock, and therefore sharing in the same clock cycle. This concept maps well to FPGA architectures, where hard macro blocks are typically capable of running at higher frequencies than standard logic. While this technique has been demonstrated for multipliers, modern DSP blocks are more complex with multiple computational nodes. In this paper, we apply multi-pumping to minimise DSP block usage, while taking advantage of the multiple nodes they support. The proposed approach uses, on average, 39% fewer DSP blocks, at a cost of 19% more LUTs and 7% more registers

    Minimising DSP block usage through multi-pumping

    Get PDF
    Resource sharing in the mapping of an algorithm to an architecture allows the same resource to be scheduled for different uses in different cycles, generally at the cost of increased schedule length. Multi-pumping is a method whereby a resource is clocked at a frequency that is a multiple of the surrounding circuit, thereby offering multiple executions per global clock, and therefore sharing in the same clock cycle. This concept maps well to FPGA architectures, where hard macro blocks are typically capable of running at higher frequencies than standard logic. While this technique has been demonstrated for multipliers, modern DSP blocks are more complex with multiple computational nodes. In this paper, we apply multi-pumping to minimise DSP block usage, while taking advantage of the multiple nodes they support. The proposed approach uses, on average, 39% fewer DSP blocks, at a cost of 19% more LUTs and 7% more registers

    Mapping for Maximum Performance on FPGA DSP Blocks

    Full text link

    Exploiting DSP block capabilities in FPGA high level design flows

    No full text
    The embedded DSP blocks in modern Field Programmable Gate Arrays (FPGAs) are highly capable and support a variety of different data path configurations. These evolved to support a range of applications requiring significant amounts of fast arithmetic. In addition to all the computational capabilities, DSP blocks support runtime dynamic programmability, which allows a single DSP block to be used as a different computational block in every clock cycle. Vendor synthesis tools can infer the use of these resources but they do not exploit their full capabilities, especially the dynamic configuration. Specific language structures arc suggested for implementing standard applications but others that do not fit these standard designs can suffer from inefficient mapping. High-level synthesis (HLS) tools rely on the backend synthesis tools to map efficiently to the target architecture. This thesis explores how DSP blocks can be exploited to produce high throughput computational kernels at close the theoretical limit of the primitives, and how t heir dynamic programmability can be exploited to create efficient implementations. We show that this can be achieved using a high level description, but only by considering architectural information at higher levels. An automated tool flow is presented that takes a high-level description of a computational kernel in C and generates synthesisable Verilog that achieves performance close to theoretical limits of the DSP block with hand-optimised designs. We extend this tool to support proposed techniques for resource sharing of DSP blocks, adapting traditional approaches for the high latency of the DSP blocks, and also applying multi-pumping in this new context. This detailed design results in circuits that always operate at close to the theoretical limits, and offer full utilisation of the DSP block.Doctor of Philosophy (SCE

    Evaluating the efficiency of DSP Block synthesis inference from flow graphs

    No full text
    The embedded DSP Blocks in FPGAs have become significantly more capable in recent generations of devices. While vendor synthesis tools can infer the use of these resources, the efficiency of this inference is not guaranteed. Specific language structures are suggested for implementing standard applications but others that do not fit these standard designs can suffer from inefficient synthesis inference. In this paper, we demonstrate this effect by synthesising a number of arithmetic circuits, showing that standard code results in a significant resource and timing overhead compared to considered use of DSP Blocks and their plethora of configuration options through custom instantiation

    Is FPGA Useful for Hash Joins?

    No full text
    10th Annual Conference on Innovative Data Systems Research (CIDR ‘20)P2

    On-the-fly parallel data shuffling for graph processing on OpenCL-based FPGAs

    No full text
    10.1109/FPL.2019.0002029th International Conference on Field Programmable Logic and Applications (FPL)67-7
    corecore